INTERSPEECH.2010 - Speech Processing

Total: 75

#1 A factorial sparse coder model for single channel source separation [PDF] [Copy] [Kimi1]

Authors: Robert Peharz ; Michael Stark ; Franz Pernkopf ; Yannis Stylianou

We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with l0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.

#2 Oriented PCA method for blind speech separation of convolutive mixtures [PDF] [Copy] [Kimi1]

Authors: Yasmina Benabderrahmane ; Sid Ahmed Selouani ; Douglas O'Shaughnessy

This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm

#3 Online Gaussian process for nonstationary speech separation [PDF] [Copy] [Kimi1]

Authors: Hsin-Lung Hsieh ; Jen-Tzung Chien

In a practical speech enhancement system, it is required to enhance speech signals from the mixed signals, which were corrupted due to the nonstationary source signals and mixing conditions. The source voices may be from different moving speakers. The speakers may abruptly appear or disappear and may be permuted continuously. To deal with these scenarios with a varying number of sources, we present a new method for nonstationary speech separation. An online Gaussian process independent component analysis (OLGP-ICA) is developed to characterize the real-time temporal structure in time-varying mixing system and to capture the evolved statistics of independent sources from online observed signals. A variational Bayes algorithm is established to estimate the evolved parameters for dynamic source separation. In the experiments, the proposed OLGP-ICA is compared with other ICA methods and is illustrated to be effective in recovering speech and music signals in a nonstationary speaking environment.

#4 Convexity and fast speech extraction by split bregman method [PDF] [Copy] [Kimi1]

Authors: Meng Yu ; Wenye Ma ; Jack Xin ; Stanley Osher

A fast speech extraction (FSE) method is presented using convex optimization made possible by pause detection of the speech sources. Sparse unmixing filters are sought by L1 regularization and the split Bregman method. A subdivided split Bregman method is developed for efficiently estimating long reverberations in real room recordings. The speech pause detection is based on a binary mask source separation method. The FSE method is evaluated and found to outperform existing blind speech separation approaches on both synthetic and room recorded data in terms of the overall computational speed and separation quality.

#5 Reducing musical noise in blind source separation by time-domain sparse filters and split bregman method [PDF] [Copy] [Kimi1]

Authors: Wenye Ma ; Meng Yu ; Jack Xin ; Stanley Osher

Musical noise often arises in the outputs of time-frequency binary mask based blind source separation approaches. Post-processing is desired to enhance the separation quality. An efficient musical noise reduction method by time-domain sparse filters is presented using convex optimization. The sparse filters are sought by L1 regularization and the split Bregman method. The proposed musical noise reduction method is evaluated by both synthetic and room recorded speech and music data, and found to outperform existing musical noise reduction methods in terms of the objective and subjective measures.

#6 Combining monaural and binaural evidence for reverberant speech segregation [PDF] [Copy] [Kimi1]

Authors: John Woodruff ; Rohit Prabhavalkar ; Eric Fosler-Lussier ; DeLiang Wang

Most existing binaural approaches to speech segregation rely on spatial filtering. In environments with minimal reverberation and when sources are well separated in space, spatial filtering can achieve excellent results. However, in everyday environments performance degrades substantially. To address these limitations, we incorporate monaural analysis within a binaural segregation system. We use monaural cues to perform both local and across frequency grouping of mixture components, allowing for a more robust application of spatial filtering. We propose a novel framework in which we combine monaural grouping evidence and binaural localization evidence in a linear model for the estimation of the ideal binary mask. Results indicate that with appropriately designed features that capture both monaural and binaural evidence, an extremely simple model achieves a signal-to-noise ratio improvement of up to 4 dB relative to using spatial filtering alone.

#7 Multichannel noise reduction using low order RTF estimate [PDF] [Copy] [Kimi1]

Authors: Subhojit Chakladar ; Nam Soo Kim ; Yu Gwang Jin ; Tae Gyoon Kang

The relative transfer function generalized sidelobe canceler (RTF-GSC) is a popular method for implementing multichannel speech enhancement. However, an accurate estimation of channel transfer function ratios pose a challenge, especially in noisy environments. In this work, we demonstrate that even a very low order RTF estimate can give superior performance in terms of noise reduction without incurring excessive speech distortion. We show that noise reduction is dependent on the correlation between the input noise and the noise reference generated by the Blocking Matrix (BM), and that a low order RTF estimate preserves this correlation better than a high order one. The performance for both high order and low order RTF estimates are compared using output SNR, Noise Reduction and a perceptual measure for speech quality.

#8 Reinforced blocking matrix with cross channel projection for speech enhancement [PDF] [Copy] [Kimi1]

Authors: Inho Lee ; Jongsung Yoon ; Yoonjae Lee ; Hanseok Ko

In this paper, we propose a reinforced Blocking Matrix of TF-GSC by incorporating a cross channel projection for speech enhancement. Transfer function GSC (TF-GSC) proposed by Gannot was aimed at improving speech quality but the desired speech signal becomes somewhat distorted since the reference signal resulting from blocking matrix significantly contains the desired signal. The proposed reinforcement on the Blocking Matrix is a scheme to remove the highly correlated components between the inter-channel reference signals using orthogonal projection, thereby completely eliminating the desired signal. Representative experiments show that the proposed scheme is effective and its strength is demonstrated in terms of improved averaged signal noise ratio(SNR) and Log Spectral Distance(LSD).

#9 Masking property based microphone array post-filter design [PDF] [Copy] [Kimi1]

Authors: Ning Cheng ; Wenju Liu ; Lan Wang

This paper presents a novel post-filter for noise reduction. A subspace based noise estimation method is developed with the use of multiple statistical distributions to model the speech and noise. The signal-plus-noise subspace dimension is determined by maximizing the target speech presence probability in noisy frames, so as to estimate the noise power spectrum for post-filter design. Then, masking property is incorporated in the post-filter technique for residual noise shaping. Experimental results show that the proposed scheme outperforms the baseline systems in terms of various quality measurements of the enhanced speech.

#10 Reduction of broadband noise in speech signals by multilinear subspace analysis [PDF] [Copy] [Kimi1]

Authors: Yusuke Sato ; Tetsuya Hoya ; Hovagim Bakardjian ; Andrzej Cichocki

A new noise reduction method for speech signals is proposed in this paper. The method is based upon the N-mode singular value decomposition algorithm, which exploits the multilinear subspace analysis of given speech data. Simulation results using both synthetically generated and real broadband noise components show that the enhancement quality obtained by the multilinear subspace analysis method in terms of both segmental gain and cepstral distance, as well as informal listening tests, is superior to that by a conventional nonlinear spectral subtraction method and the previously proposed approach based upon sliding subspace projection.

#11 Novel probabilistic control of noise reduction for improved microphone array beamforming [PDF] [Copy] [Kimi1]

Authors: Jungpyo Hong ; Seungho Han ; Sangbae Jeong ; Minsoo Hahn

In this paper, a novel speech enhancement algorithm is proposed. The algorithm controls the amount of noise reduction according to whether speech absence or presence in noisy environments. Based on the estimated speech absence probability (SAP), the amount of noise reduction is adaptively controlled. To calculate the SAP, normalized cross correlation of linear predictive residual signals instead of that of original input signals is utilized. It is especially robust and effective in reverberant and realistic environments. Experimental results show that the proposed algorithm improves speech recognition rates compared with conventional linearly constraint minimum variance beamforming.

#12 Speech enhancement using improved generalized sidelobe canceller in frequency domain with multi-channel postfiltering [PDF] [Copy] [Kimi1]

Authors: Kai Li ; Qiang Fu ; Yonghong Yan

In this paper, we propose a speech enhancement algorithm which has the feature of interaction between adaptive beamforming and multi-channel postfilter.A novel subband feedback controller based on speech presence probability is applied to Generalized Sidelobe Canceller algorithm to obtain a more robust adaptive beamforming in adverse environment and alleviate the problem of signal cancellation. A multi-channel postfiltering is used not only to further suppress diffuse noises and some transient interferences, but also to give the speech presence probability information in each subband. Experimental results show that the proposed algorithm achieves considerable improvement on signal preservation of the desired speech in adverse noise environments,consisting of both directional and diffused noises over the comparative algorithms.

#13 Close speaker cancellation for suppression of non-stationary background noise for hands-free speech interface [PDF] [Copy] [Kimi1]

Authors: Jani Even ; Carlos Ishi ; Hiroshi Saruwatari ; Norihiro Hagita

This paper presents a noise cancellation method based on the ability to efficiently cancel a close target speaker contribution from the signals observed at a microphone array. The proposed method exploits this specificity in the case of the hands-free speech interface. This method is in particular able to deal with non-stationary noise. The method can be divided in three steps. First, the steering vector pointing at the target user is estimated from the covariance of the observed signals. Then the noise estimate is obtained by cancelling the user's contribution. During this step the speech pauses are also estimated. Finally a post-filter is used to suppress this estimated noise from the observed signals. The post-filter strength is controlled by using the estimated noise during the speech pauses as reference. A 20k-words dictation task in presence of non-stationary diffuse background noise at different SNR levels illustrates the effectiveness of the proposed method.

#14 Multi-channel iterative dereverberation based on codebook constrained iterative multi-channel wiener filter [PDF] [Copy] [Kimi1]

Authors: Ajay Srinivasamurthy ; Thippur V. Sreenivas

A novel Multi-channel Iterative Dereverberation (MID) algorithm based on Codebook Constrained Iterative Multi-channel Wiener Filter (CCIMWF) is proposed. We extend the classical iterative wiener filter (IWF) to the multi-channel dereverberation case. The late reverberations are estimated using Long-term Multi-step Linear Prediction (LTMLP). This estimate is used in CCIMWF framework through a doubly iterative formulation. A clean speech VQ codebook is effective for inducing intra-frame constraints and improve the convergence of IWF, thus, a joint-CCIMWF algorithm is proposed for the multi-channel case. The signal to reverberation ratio (SRR) and log spectral distortion (LSD) measures improve through the double-iterations, showing that the algorithm suppresses the effect of late reverberations and improves speech quality and intelligibility. The algorithm also has fair convergence properties through the iterations.

#15 Speaker-dependent mapping of source and system features for enhancement of throat microphone speech [PDF] [Copy] [Kimi1]

Authors: Anand Joseph Xavier Medabalimi ; Sri Harish Reddy Mallidi ; Bayya Yegnanarayana

A throat microphone (TM) produces speech which is perceptually poorer than that produced by a close speaking microphone (CSM) speech. Many attempts at improving the quality of TM speech have been made by mapping the features corresponding to the vocal tract system. These techniques are limited by the methods used to generate the excitation signal. In this paper a method to map the source (excitation) using multilayer feed-forward neural networks is proposed for voiced segments. This method anchors the analysis windows at the regions around the instants of glottal closure, so that the non-linear characteristics in these region of TM and CSM microphone is emphasized in the mapping process. The features obtained from these regions for both TM and CSM speech are used to train a MLFFNN to capture the non-linear relation between them. An improved technique for mapping the system features is also proposed. Speech synthesized using the proposed techniques was evaluated through subjective tests and was found to be significantly better than TM speech.

#16 An analytic modeling approach to enhancing throat microphone speech commands for keyword spotting [PDF] [Copy] [Kimi1]

Authors: Jun Cai ; Stefano Marini ; Pierre Malarme ; Francis Grenez ; Jean Schoentgen

This research was carried out on enhancing throat microphone speech for noise-robust speech keyword spotting. The enhancement was performed by mapping the log-energy in the Mel-frequency bands of throat microphone speech to those of the corresponding close-talk microphone speech. An analytic equation detection system, Eureqa, which can infer nonlinear relations directly from observed data, was used to identify the enhancement models. Speech recognition experiments with the enhanced throat microphone speech keywords indicate that the analytic enhancement models performed well in terms of recognition accuracy. Unvoiced consonants, however, could not be enhanced well enough, mostly because they were not effectively recorded by the throat microphone.

#17 Single-channel speech enhancement using kalman filtering in the modulation domain [PDF] [Copy] [Kimi1]

Authors: Stephen So ; Kamil K. Wójcicki ; Kuldip K. Paliwal

In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.

#18 Integrated feedback and noise reduction algorithm in digital hearing aids via oscillation detection [PDF] [Copy] [Kimi1]

Authors: Miao Yao ; Weiqian Liang

In this paper, an integrated feedback and noise reduction scheme in hearing aids is developed. The technique presented is based on the adaptive feedback cancellation (AFC) and general sidelobe canceller (GSC) with a band-limited adaptation method to better the convergence behavior of both AFC and GSC. The band pass pre-filter is applied to AFC and the band stop pre-filter is applied to GSC to increase the portion of desired signal. An oscillation detector based on the zero crossing rates of the autocorrelation of sub-band signals is designed to calculate the center frequency of the oscillation to make the band-limited adaptation more robust. Convergence analysis and computer simulation illustrate that the proposed algorithm performs effectively to reduce the feedback and noise.

#19 A blind signal-to-noise ratio estimator for high noise speech recordings [PDF] [Copy] [Kimi1]

Authors: Charles Mercier ; Roch Lefebvre

Blind estimation of the signal-to-noise ratio in noisy speech recordings is useful to enhance the performance of many speech processing algorithms. Most current techniques are efficient in low noise environments only, justifying the need for a high noise estimator, such as the one presented here. A pitch tracker robust in high noise was developed and is used to create a two-dimensional representation of the audio input. Signal-to-noise ratio estimation is then performed using an image processing algorithm, effectively combining the short-term and long-term properties of speech. The proposed technique is shown to perform accurately even in high noise situations.

#20 Fast converging iterative kalman filtering for speech enhancement using long and overlapped tapered windows with large side lobe attenuation [PDF] [Copy] [Kimi1]

Authors: Stephen So ; Kuldip K. Paliwal

In this paper, we propose an iterative Kalman filtering scheme that has faster convergence and introduces less residual noise, when compared with the iterative scheme of Gibson, et al. This is achieved via the use of long and overlapped frames as well as using a tapered window with a large side lobe attenuation for linear prediction analysis. We show that the Dolph-Chebychev window with a -200 dB side lobe attenuation tends to enhance the dynamic range of the formant structure of speech corrupted with white noise, reduce prediction error variance bias, as well as provide for some spectral smoothing, while the long overlapped frames provide for reliable autocorrelation estimates and temporal smoothing. Speech enhancement experiments on the NOIZEUS corpus show that the proposed method outperformed conventional iterative and non-iterative Kalman filters as well as other enhancement methods such as MMSE-STSA and PSC.

#21 Robust noise estimation using minimum correction with harmonicity control [PDF] [Copy] [Kimi1]

Authors: Xuejing Sun ; Kuan-Chieh Yen ; Rogerio Alves

In this paper a new noise spectrum estimation algorithm is described for single-channel acoustic noise suppression systems. To achieve fast convergence during abrupt change of noise floor, the proposed algorithm uses a minimum correction module to adjust an adaptive noise estimator. The minimum search duration is controlled by a harmonicity module for improved noise tracking under continuous voicing condition. Objective test results show that the proposed algorithm consistently outperforms competitive noise estimation methods.

#22 New insights into subspace noise tracking [PDF] [Copy] [Kimi1]

Author: Mahdi Triki

Various speech enhancement techniques rely on the knowledge of the clean signal and noise statistics. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. The estimation of noise (and speech) statistics is particularly challenging under non-stationary noise conditions. In this respect, subspace-based approaches have been shown to provide a good tracking vs. final misadjustment tradeoff. Subspace-based techniques hinge critically on both rank-limited and spherical assumptions of the speech and the noise DFT matrices, respectively. The speech rank-limited assumption was previously experimentally tested and validated. In this paper, we will investigate the structure of nuisance sources. We will discuss the validity of the spherical assumption for a variety of nuisance sources (environmental noise, reverberation), and preprocessing (overlapping segmentation).

#23 Bias considerations for minimum subspace noise tracking [PDF] [Copy] [Kimi1]

Authors: Mahdi Triki ; Kees Janse

Speech enhancement schemes rely generally on the knowledge of the noise power spectral density. The estimation of these statistics is particularly a critical issue and a challenging problem under non-stationary noise conditions. With this respect, subspace based approaches have shown to allow for reduced estimation delay and perform a good tracking vs. final misadjustment tradeoff. One key attribute for noise floor tracking is the estimation bias: an overestimation leads to over-suppression and to more speech distortion; while an underestimation leads to a high level of residual noise. The present paper investigates the bias of the subspace-based scheme, and particularly the robustness of the bias compensation factor to the desired speaker characteristics and the input SNR.

#24 A corpus-based approach to speech enhancement from nonstationary noise [PDF] [Copy] [Kimi1]

Authors: Ji Ming ; Ramji Srinivasan ; Danny Crookes

This paper addresses single-channel speech enhancement assuming difficulties in predicting the noise statistics. We describe an approach which aims to maximally extract the two features of speech - its temporal dynamics and speaker characteristics - to improve the noise immunity. This is achieved by recognizing long speech segments as whole units from noise. In the recognition, clean speech sentences, taken from a speech corpus, are used as examples. Experiments have been conducted on the TIMIT database for separating various types of nonstationary noise including song, music, and crosstalk speech. The new approach has demonstrated improved performance over conventional speech enhancement algorithms in both objective and subjective evaluations.

#25 Bandwidth expansion of speech based on wavelet transform modulus maxima vector mapping [PDF] [Copy] [Kimi1]

Authors: Zhe Chen ; You-Chi Cheng ; Fuliang Yin ; Chin-Hui Lee

A novel approach to speech bandwidth expansion based on wavelet transform modulus maxima vector mapping is proposed. By taking advantage of the similarity of the modulus maxima vectors between narrowband and wideband wavelet-analyzed signals a neural network mapping structure can be established to perform bandwidth expansion given only the narrowband version of speech. Since the proposed algorithm works on the time-domain waveforms it offers a flexibility of variable-length frame selection that facilitates low delay and potentially data-dependent speech segment processing to further improve the speech quality. Evaluations based on both objective and subjective measures show that the proposed bandwidth expansion approach results in high-quality synthesized wideband speech with little perceivable distortion from the original wideband speech signals.